Diabetes Prediction¶

1. Introduction¶

  • Problem statement # 2. Import libraries # 3. Basic Exploration
  • Read dataset
  • Some information
  • Data visualization # 4. Data preprocessing # 5. Machine Learning model
  • Logistic Regression
  • Random Forest Classifier
  • Decision Tree Classifier
  • KNeighborsClassifier Model
  • Gradient Boosting Classifier
  • Support Vector Classifier
  • xgboost Classifier
  • Hyperparameter Tuning of Random Forest
  • Hyperparameter Tuning of Decision Tree
  • Hyperparameter Tuning of KNN

6. Model Deployment¶

7. Conclusion¶

  • Conclusion
  • Future Possible Work

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------¶

1.1 | Problem statement¶

The objective of this dataset is to build a predictive model for diagnosing diabetes in female patients who are at least 21 years old and of Pima Indian heritage. The model should predict whether a patient has diabetes (Outcome = 1) or does not have diabetes (Outcome = 0) based on several diagnostic measurements, including pregnancies, glucose level, blood pressure,skin thickness, insulin level, BMI, diabetes pedigree function, and age.

In [1]:
#import libraries
In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import KFold
import warnings
warnings.simplefilter(action='ignore')
sns.set()
plt.style.use("ggplot")
%matplotlib inline
In [3]:
!pip install missingno
Requirement already satisfied: missingno in c:\users\samar\anaconda3\lib\site-packages (0.5.2)
Requirement already satisfied: numpy in c:\users\samar\anaconda3\lib\site-packages (from missingno) (1.24.3)
Requirement already satisfied: matplotlib in c:\users\samar\anaconda3\lib\site-packages (from missingno) (3.7.2)
Requirement already satisfied: scipy in c:\users\samar\anaconda3\lib\site-packages (from missingno) (1.11.1)
Requirement already satisfied: seaborn in c:\users\samar\anaconda3\lib\site-packages (from missingno) (0.12.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (1.0.5)
Requirement already satisfied: cycler>=0.10 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (9.4.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.2)
Requirement already satisfied: pandas>=0.25 in c:\users\samar\anaconda3\lib\site-packages (from seaborn->missingno) (2.0.3)
Requirement already satisfied: pytz>=2020.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas>=0.25->seaborn->missingno) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas>=0.25->seaborn->missingno) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\samar\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
In [148]:
!pip install scikit-learn
Requirement already satisfied: scikit-learn in c:\users\samar\anaconda3\lib\site-packages (1.4.2)
Requirement already satisfied: numpy>=1.19.5 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.24.3)
Requirement already satisfied: scipy>=1.6.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.11.1)
Requirement already satisfied: joblib>=1.2.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
In [147]:
!pip install -U scikit-learn
Requirement already satisfied: scikit-learn in c:\users\samar\anaconda3\lib\site-packages (1.4.2)
Requirement already satisfied: numpy>=1.19.5 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.24.3)
Requirement already satisfied: scipy>=1.6.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.11.1)
Requirement already satisfied: joblib>=1.2.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
In [122]:
! pip install streamlit
Requirement already satisfied: streamlit in c:\users\samar\anaconda3\lib\site-packages (1.34.0)
Requirement already satisfied: altair<6,>=4.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (5.3.0)
Requirement already satisfied: blinker<2,>=1.0.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (1.8.2)
Requirement already satisfied: cachetools<6,>=4.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (5.3.3)
Requirement already satisfied: click<9,>=7.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (8.0.4)
Requirement already satisfied: numpy<2,>=1.19.3 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (1.24.3)
Requirement already satisfied: packaging<25,>=16.8 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (23.1)
Requirement already satisfied: pandas<3,>=1.3.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (2.0.3)
Requirement already satisfied: pillow<11,>=7.1.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (9.4.0)
Requirement already satisfied: protobuf<5,>=3.20 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (4.25.3)
Requirement already satisfied: pyarrow>=7.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (11.0.0)
Requirement already satisfied: requests<3,>=2.27 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (2.31.0)
Requirement already satisfied: rich<14,>=10.14.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (13.7.1)
Requirement already satisfied: tenacity<9,>=8.1.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (8.2.2)
Requirement already satisfied: toml<2,>=0.10.1 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (0.10.2)
Requirement already satisfied: typing-extensions<5,>=4.3.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (4.7.1)
Requirement already satisfied: gitpython!=3.1.19,<4,>=3.0.7 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (3.1.43)
Requirement already satisfied: pydeck<1,>=0.8.0b4 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (0.9.0)
Requirement already satisfied: tornado<7,>=6.0.3 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (6.3.2)
Requirement already satisfied: watchdog>=2.1.5 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (2.1.6)
Requirement already satisfied: jinja2 in c:\users\samar\anaconda3\lib\site-packages (from altair<6,>=4.0->streamlit) (3.1.2)
Requirement already satisfied: jsonschema>=3.0 in c:\users\samar\anaconda3\lib\site-packages (from altair<6,>=4.0->streamlit) (4.17.3)
Requirement already satisfied: toolz in c:\users\samar\anaconda3\lib\site-packages (from altair<6,>=4.0->streamlit) (0.12.0)
Requirement already satisfied: colorama in c:\users\samar\anaconda3\lib\site-packages (from click<9,>=7.0->streamlit) (0.4.6)
Requirement already satisfied: gitdb<5,>=4.0.1 in c:\users\samar\anaconda3\lib\site-packages (from gitpython!=3.1.19,<4,>=3.0.7->streamlit) (4.0.11)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\samar\anaconda3\lib\site-packages (from pandas<3,>=1.3.0->streamlit) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas<3,>=1.3.0->streamlit) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas<3,>=1.3.0->streamlit) (2023.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (2023.7.22)
Requirement already satisfied: markdown-it-py>=2.2.0 in c:\users\samar\anaconda3\lib\site-packages (from rich<14,>=10.14.0->streamlit) (2.2.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in c:\users\samar\anaconda3\lib\site-packages (from rich<14,>=10.14.0->streamlit) (2.15.1)
Requirement already satisfied: smmap<6,>=3.0.1 in c:\users\samar\anaconda3\lib\site-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit) (5.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\samar\anaconda3\lib\site-packages (from jinja2->altair<6,>=4.0->streamlit) (2.1.1)
Requirement already satisfied: attrs>=17.4.0 in c:\users\samar\anaconda3\lib\site-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (22.1.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\samar\anaconda3\lib\site-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (0.18.0)
Requirement already satisfied: mdurl~=0.1 in c:\users\samar\anaconda3\lib\site-packages (from markdown-it-py>=2.2.0->rich<14,>=10.14.0->streamlit) (0.1.0)
Requirement already satisfied: six>=1.5 in c:\users\samar\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas<3,>=1.3.0->streamlit) (1.16.0)
In [4]:
#Basic exploration
In [5]:
# read the dataset from dir
df = pd.read_csv("diabetes.csv")
In [6]:
df.head()
Out[6]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [8]:
df.columns
Out[8]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
In [9]:
# descriptive statistics of the dataset
df.describe()
Out[9]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [10]:
# (row, columns)
df.shape
Out[10]:
(768, 9)
In [11]:
# distribution of outcome variable
df.Outcome.value_counts()*100/len(df)
Out[11]:
Outcome
0    65.104167
1    34.895833
Name: count, dtype: float64
In [12]:
df['Outcome'].value_counts()*100/len(df)
Out[12]:
Outcome
0    65.104167
1    34.895833
Name: count, dtype: float64
In [13]:
#histogram to understand the distribution
import warnings
warnings.filterwarnings("ignore")
for i in df.select_dtypes(include="number").columns:
    sns.histplot(data=df,x=i)
    plt.show()
In [14]:
# plot the hist of the age variable
plt.figure(figsize=(8,7))
plt.xlabel('Age', fontsize=10)
plt.ylabel('Count', fontsize=10)
df['Age'].hist(edgecolor="black")
Out[14]:
<Axes: xlabel='Age', ylabel='Count'>
In [15]:
df['Age'].max()
Out[15]:
81
In [16]:
df['Age'].min()
Out[16]:
21
In [17]:
print("MAX AGE: "+str(df['Age'].max()))
print("MIN AGE: "+str(df['Age'].min()))
MAX AGE: 81
MIN AGE: 21
In [18]:
df.columns
Out[18]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
In [19]:
# density graph
# 4*2=8
# columns=2 figure
# having 4 row

# [0,0], [0,1]
# [1,0], [1,1]
# [2,0], [2,1]
# [3,0], [3,1]

fig,ax = plt.subplots(4,2, figsize=(20,20))
sns.distplot(df.Pregnancies, bins=20, ax=ax[0,0], color="red")
sns.distplot(df.Glucose, bins=20, ax=ax[0,1], color="red")
sns.distplot(df.BloodPressure, bins=20, ax=ax[1,0], color="red")
sns.distplot(df.SkinThickness, bins=20, ax=ax[1,1], color="red")
sns.distplot(df.Insulin, bins=20, ax=ax[2,0], color="red")
sns.distplot(df.BMI, bins=20, ax=ax[2,1], color="red")
sns.distplot(df.DiabetesPedigreeFunction, bins=20, ax=ax[3,0], color="red")
sns.distplot(df.Age, bins=20, ax=ax[3,1], color="red")
Out[19]:
<Axes: xlabel='Age', ylabel='Density'>
In [20]:
plt.figure(figsize=(20,6))
plt.subplot(1,3,1)
plt.title("Counter Plot")
sns.countplot(x = 'Pregnancies', data = df)

plt.subplot(1,3,2)
plt.title('Distribution Plot')
sns.distplot(df["Pregnancies"])

plt.subplot(1,3,3)
plt.title('Box Plot')
sns.boxplot(y=df["Pregnancies"])

plt.show()
In [21]:
df.columns
Out[21]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
In [22]:
df.groupby("Outcome").agg({'Pregnancies':'mean'})
Out[22]:
Pregnancies
Outcome
0 3.298000
1 4.865672
In [23]:
df.groupby("Outcome").agg({'Pregnancies':'max'})
Out[23]:
Pregnancies
Outcome
0 13
1 17
In [24]:
df.groupby("Outcome").agg({'Glucose':'mean'})
Out[24]:
Glucose
Outcome
0 109.980000
1 141.257463
In [25]:
df.groupby("Outcome").agg({'Glucose':'max'})
Out[25]:
Glucose
Outcome
0 197
1 199
In [26]:
# 'BloodPressure', 'SkinThickness', 'Insulin',
# 'BMI', 'DiabetesPedigreeFunction', 'Age'
#  groupby-> mean/max
In [27]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the count of each category in the 'Outcome' column
plt.figure(figsize=(12, 6))
sns.countplot(x='Outcome', data=df)
plt.title('Count of Outcome Categories')
plt.xlabel('Outcome')
plt.ylabel('Count')
f,ax = plt.subplots(1,2, figsize=(18,8))
df['Outcome'].value_counts().plot.pie(explode=[0,0.1],autopct = "%1.1f%%", ax=ax[0], shadow=True)
ax[0].set_title('target')
ax[0].set_ylabel('')
plt.show()
In [28]:
df.corr()
Out[28]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683 -0.033523 0.544341 0.221898
Glucose 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071 0.137337 0.263514 0.466581
BloodPressure 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805 0.041265 0.239528 0.065068
SkinThickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573 0.183928 -0.113970 0.074752
Insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859 0.185071 -0.042163 0.130548
BMI 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000 0.140647 0.036242 0.292695
DiabetesPedigreeFunction -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647 1.000000 0.033561 0.173844
Age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242 0.033561 1.000000 0.238356
Outcome 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695 0.173844 0.238356 1.000000
In [29]:
f,ax = plt.subplots(figsize=[20,15])
sns.heatmap(df.corr(), annot=True, fmt = '.2f', ax=ax, cmap='magma')
ax.set_title("Correlation Matrix", fontsize=20)
plt.show()
In [30]:
# EDA Part Completed
df.columns
Out[30]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
In [31]:
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']] = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']].replace(0, np.NaN)
In [32]:
# Data preprocessing Part
df.isnull().sum()
Out[32]:
Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64
In [33]:
df.head()
Out[33]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6.0 148.0 72.0 35.0 NaN 33.6 0.627 50 1
1 1.0 85.0 66.0 29.0 NaN 26.6 0.351 31 0
2 8.0 183.0 64.0 NaN NaN 23.3 0.672 32 1
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0
4 NaN 137.0 40.0 35.0 168.0 43.1 2.288 33 1
In [34]:
import missingno as msno
msno.bar(df, color="orange")
Out[34]:
<Axes: >
In [35]:
#median
def median_target(var):   
    temp = df[df[var].notnull()]
    temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index()
    return temp
In [36]:
columns = df.columns
columns = columns.drop("Outcome")
for i in columns:
    median_target(i)
    df.loc[(df['Outcome'] == 0 ) & (df[i].isnull()), i] = median_target(i)[i][0]
    df.loc[(df['Outcome'] == 1 ) & (df[i].isnull()), i] = median_target(i)[i][1]
In [37]:
df.head()
Out[37]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1
In [38]:
df.isnull().sum()
Out[38]:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
In [39]:
# pair plot
p = sns.pairplot(df, hue="Outcome")
In [40]:
#Data preprocessing
In [41]:
# Outlier Detection
# IQR+Q1
# 50%
# 24.65->25%+50%
# 24.65->25%
for feature in df:
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3-Q1
    lower = Q1-1.5*IQR
    upper = Q3+1.5*IQR
    if df[(df[feature]>upper)].any(axis=None):
        print(feature, "yes")
    else:
        print(feature, "no")
Pregnancies yes
Glucose no
BloodPressure yes
SkinThickness yes
Insulin yes
BMI yes
DiabetesPedigreeFunction yes
Age yes
Outcome no
In [42]:
#Boxplot to identify outliers
import warnings
warnings.filterwarnings("ignore")
for i in df.select_dtypes(include="number").columns:
    sns.boxplot(data=df,x=i)
    plt.show()
In [43]:
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Insulin"], color="red")
Out[43]:
<Axes: xlabel='Insulin'>
In [44]:
Q1 = df.Insulin.quantile(0.25)
Q3 = df.Insulin.quantile(0.75)
IQR = Q3-Q1
lower = Q1-1.5*IQR
upper = Q3+1.5*IQR
df.loc[df['Insulin']>upper, "Insulin"] = upper
In [45]:
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Insulin"], color="red")
Out[45]:
<Axes: xlabel='Insulin'>
In [46]:
# LOF
# local outlier factor
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=10)
lof.fit_predict(df)
Out[46]:
array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1])
In [47]:
df.head()
Out[47]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1
In [48]:
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Pregnancies"], color="red")
Out[48]:
<Axes: xlabel='Pregnancies'>
In [49]:
df_scores = lof.negative_outlier_factor_
np.sort(df_scores)[0:20]
Out[49]:
array([-3.06509976, -2.38250393, -2.15557018, -2.11501347, -2.08356175,
       -1.95386655, -1.83559384, -1.74974237, -1.7330214 , -1.71017168,
       -1.70215105, -1.68722889, -1.64294601, -1.64180205, -1.61181746,
       -1.61067772, -1.60925053, -1.60214364, -1.59998552, -1.58761193])
In [50]:
thresold = np.sort(df_scores)[7]
In [51]:
thresold
Out[51]:
-1.7497423670960557
In [52]:
outlier = df_scores>thresold
In [53]:
df = df[outlier]
In [54]:
df.head()
Out[54]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1
In [55]:
df.shape
Out[55]:
(760, 9)
In [56]:
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Pregnancies"], color="red")
Out[56]:
<Axes: xlabel='Pregnancies'>
In [57]:
#Feature Enginnering
NewBMI = pd.Series(["Underweight","Normal", "Overweight","Obesity 1", "Obesity 2", "Obesity 3"], dtype = "category")
In [58]:
NewBMI
Out[58]:
0    Underweight
1         Normal
2     Overweight
3      Obesity 1
4      Obesity 2
5      Obesity 3
dtype: category
Categories (6, object): ['Normal', 'Obesity 1', 'Obesity 2', 'Obesity 3', 'Overweight', 'Underweight']
In [59]:
df['NewBMI'] = NewBMI
df.loc[df["BMI"]<18.5, "NewBMI"] = NewBMI[0]
df.loc[(df["BMI"]>18.5) & df["BMI"]<=24.9, "NewBMI"] = NewBMI[1]
df.loc[(df["BMI"]>24.9) & df["BMI"]<=29.9, "NewBMI"] = NewBMI[2]
df.loc[(df["BMI"]>29.9) & df["BMI"]<=34.9, "NewBMI"] = NewBMI[3]
df.loc[(df["BMI"]>34.9) & df["BMI"]<=39.9, "NewBMI"] = NewBMI[4]
df.loc[df["BMI"]>39.9, "NewBMI"] = NewBMI[5]
In [60]:
df.head()
Out[60]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome NewBMI
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1 Obesity 2
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0 Obesity 2
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1 Obesity 2
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0 Obesity 2
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 Obesity 3
In [61]:
# if insulin>=16 & insuline<=166->normal
def set_insuline(row):
    if row["Insulin"]>=16 and row["Insulin"]<=166:
        return "Normal"
    else:
        return "Abnormal"
In [62]:
df = df.assign(NewInsulinScore=df.apply(set_insuline, axis=1))
In [63]:
df.head()
Out[63]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome NewBMI NewInsulinScore
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1 Obesity 2 Abnormal
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0 Obesity 2 Normal
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1 Obesity 2 Abnormal
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0 Obesity 2 Normal
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 Obesity 3 Abnormal
In [64]:
# Some intervals were determined according to the glucose variable and these were assigned categorical variables.
NewGlucose = pd.Series(["Low", "Normal", "Overweight", "Secret", "High"], dtype = "category")
df["NewGlucose"] = NewGlucose
df.loc[df["Glucose"] <= 70, "NewGlucose"] = NewGlucose[0]
df.loc[(df["Glucose"] > 70) & (df["Glucose"] <= 99), "NewGlucose"] = NewGlucose[1]
df.loc[(df["Glucose"] > 99) & (df["Glucose"] <= 126), "NewGlucose"] = NewGlucose[2]
df.loc[df["Glucose"] > 126 ,"NewGlucose"] = NewGlucose[3]
In [65]:
df.head()
Out[65]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome NewBMI NewInsulinScore NewGlucose
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1 Obesity 2 Abnormal Secret
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0 Obesity 2 Normal Normal
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1 Obesity 2 Abnormal Secret
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0 Obesity 2 Normal Normal
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 Obesity 3 Abnormal Secret
In [66]:
# One hot encoding
df = pd.get_dummies(df, columns = ["NewBMI", "NewInsulinScore", "NewGlucose"], drop_first=True)
In [67]:
df.head()
Out[67]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome NewBMI_Obesity 1 NewBMI_Obesity 2 NewBMI_Obesity 3 NewBMI_Overweight NewBMI_Underweight NewInsulinScore_Normal NewGlucose_Low NewGlucose_Normal NewGlucose_Overweight NewGlucose_Secret
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1 False True False False False False False False False True
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0 False True False False False True False True False False
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1 False True False False False False False False False True
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0 False True False False False True False True False False
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 False False True False False False False False False True
In [68]:
# Specify the columns to be converted to 0 or 1
columns_to_convert = ['NewBMI_Obesity 1', 'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight', 'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low', 'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']
for column in columns_to_convert:
    df[column] = df[column].map({True: 1, False: 0})
In [69]:
df.head()
Out[69]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome NewBMI_Obesity 1 NewBMI_Obesity 2 NewBMI_Obesity 3 NewBMI_Overweight NewBMI_Underweight NewInsulinScore_Normal NewGlucose_Low NewGlucose_Normal NewGlucose_Overweight NewGlucose_Secret
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1 0 1 0 0 0 0 0 0 0 1
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0 0 1 0 0 0 1 0 1 0 0
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1 0 1 0 0 0 0 0 0 0 1
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0 0 1 0 0 0 1 0 1 0 0
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 0 0 1 0 0 0 0 0 0 1
In [70]:
df.columns
Out[70]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'NewBMI_Obesity 1',
       'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
       'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
       'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'],
      dtype='object')
In [71]:
categorical_df = df[['NewBMI_Obesity 1',
       'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
       'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
       'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']]
In [72]:
categorical_df.head()
Out[72]:
NewBMI_Obesity 1 NewBMI_Obesity 2 NewBMI_Obesity 3 NewBMI_Overweight NewBMI_Underweight NewInsulinScore_Normal NewGlucose_Low NewGlucose_Normal NewGlucose_Overweight NewGlucose_Secret
0 0 1 0 0 0 0 0 0 0 1
1 0 1 0 0 0 1 0 1 0 0
2 0 1 0 0 0 0 0 0 0 1
3 0 1 0 0 0 1 0 1 0 0
4 0 0 1 0 0 0 0 0 0 1
In [73]:
df.head()
Out[73]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome NewBMI_Obesity 1 NewBMI_Obesity 2 NewBMI_Obesity 3 NewBMI_Overweight NewBMI_Underweight NewInsulinScore_Normal NewGlucose_Low NewGlucose_Normal NewGlucose_Overweight NewGlucose_Secret
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50 1 0 1 0 0 0 0 0 0 0 1
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31 0 0 1 0 0 0 1 0 1 0 0
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32 1 0 1 0 0 0 0 0 0 0 1
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21 0 0 1 0 0 0 1 0 1 0 0
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 0 0 1 0 0 0 0 0 0 1
In [74]:
df.columns
Out[74]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'NewBMI_Obesity 1',
       'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
       'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
       'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'],
      dtype='object')
In [75]:
y=df['Outcome']
X=df.drop(['Outcome','NewBMI_Obesity 1',
       'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
       'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
       'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'], axis=1)
In [76]:
cols = X.columns
index = X.index
In [77]:
X.head()
Out[77]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6.0 148.0 72.0 35.0 169.5 33.6 0.627 50
1 1.0 85.0 66.0 29.0 102.5 26.6 0.351 31
2 8.0 183.0 64.0 32.0 169.5 23.3 0.672 32
3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21
4 5.0 137.0 40.0 35.0 168.0 43.1 2.288 33
In [78]:
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(X)
X=transformer.transform(X)
X=pd.DataFrame(X, columns = cols, index = index)
In [79]:
X.head()
Out[79]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 0.75 0.775 0.000 1.000000 1.000000 0.177778 0.669707 1.235294
1 -0.50 -0.800 -0.375 0.142857 0.000000 -0.600000 -0.049511 0.117647
2 1.25 1.650 -0.500 0.571429 1.000000 -0.966667 0.786971 0.176471
3 -0.50 -0.700 -0.375 -0.714286 -0.126866 -0.433333 -0.528990 -0.470588
4 0.50 0.500 -2.000 1.000000 0.977612 1.233333 4.998046 0.235294
In [80]:
X = pd.concat([X, categorical_df], axis=1)
In [81]:
X.head()
Out[81]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age NewBMI_Obesity 1 NewBMI_Obesity 2 NewBMI_Obesity 3 NewBMI_Overweight NewBMI_Underweight NewInsulinScore_Normal NewGlucose_Low NewGlucose_Normal NewGlucose_Overweight NewGlucose_Secret
0 0.75 0.775 0.000 1.000000 1.000000 0.177778 0.669707 1.235294 0 1 0 0 0 0 0 0 0 1
1 -0.50 -0.800 -0.375 0.142857 0.000000 -0.600000 -0.049511 0.117647 0 1 0 0 0 1 0 1 0 0
2 1.25 1.650 -0.500 0.571429 1.000000 -0.966667 0.786971 0.176471 0 1 0 0 0 0 0 0 0 1
3 -0.50 -0.700 -0.375 -0.714286 -0.126866 -0.433333 -0.528990 -0.470588 0 1 0 0 0 1 0 1 0 0
4 0.50 0.500 -2.000 1.000000 0.977612 1.233333 4.998046 0.235294 0 0 1 0 0 0 0 0 0 1
In [124]:
x=df.drop('Outcome', axis=1) 
y=df['Outcome']
In [125]:
X_train, X_test, y_train , y_test = train_test_split(X,y, test_size=0.2, random_state=0)
In [126]:
LogisticRegression() 
SVC()
RandomForestClassifier(n_estimators=1000,class_weight='balanced')
GradientBoostingClassifier(n_estimators=1000)
Out[126]:
GradientBoostingClassifier(n_estimators=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(n_estimators=1000)
In [127]:
scaler =StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [128]:
# Machine Learning Algo
# Machine Learning models
# Logistic Regreesion
In [129]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
Out[129]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [130]:
y_pred = log_reg.predict(X_test)
In [131]:
accuracy_score(y_train, log_reg.predict(X_train))
Out[131]:
0.8470394736842105
In [132]:
log_reg_acc = accuracy_score(y_test, log_reg.predict(X_test))
In [133]:
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

def predict_and_plot(model, inputs, targets, name=''):
    preds = model.predict(inputs)
    accuracy = accuracy_score(targets, preds)
    print("Accuracy: {:.2f}%".format(accuracy * 100))
    
    cf = confusion_matrix(targets, preds, normalize='true')
    plt.figure()
    sns.heatmap(cf, annot=True)
    plt.xlabel('Prediction')
    plt.ylabel('Target')
    plt.title('{} Confusion Matrix'.format(name))
    
    return preds

# Predict and plot on the training data
train_preds = predict_and_plot(log_reg, X_train, y_train, 'Train')

# Predict and plot on the validation data
val_preds = predict_and_plot(log_reg, X_test, y_test, 'Validation')
Accuracy: 84.70%
Accuracy: 89.47%
In [134]:
confusion_matrix(y_test, y_pred)
Out[134]:
array([[88, 10],
       [ 6, 48]], dtype=int64)
In [135]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.94      0.90      0.92        98
           1       0.83      0.89      0.86        54

    accuracy                           0.89       152
   macro avg       0.88      0.89      0.89       152
weighted avg       0.90      0.89      0.90       152

In [136]:
# KNN
In [137]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(accuracy_score(y_train, knn.predict(X_train)))
knn_acc = accuracy_score(y_test, knn.predict(X_test))
print(accuracy_score(y_test, knn.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.875
0.881578947368421
[[88 10]
 [ 8 46]]
              precision    recall  f1-score   support

           0       0.92      0.90      0.91        98
           1       0.82      0.85      0.84        54

    accuracy                           0.88       152
   macro avg       0.87      0.87      0.87       152
weighted avg       0.88      0.88      0.88       152

In [138]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

knn_model = KNeighborsClassifier(n_neighbors=5)

knn_model.fit(X_train, y_train)

y_train_pred = knn_model.predict(X_train)

y_val_pred = knn_model.predict(X_val)

train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)

print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)

confusion = confusion_matrix(y_val, y_val_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Validation)')
plt.show()
Training Accuracy: 0.8717105263157895
Validation Accuracy: 0.8421052631578947
In [139]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9]  
}

knn_model = KNeighborsClassifier()

grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

y_train_pred = best_model.predict(X_train)

y_val_pred = best_model.predict(X_val)

train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)

print("Training Accuracy with Best Hyperparameters:", train_accuracy)
print("Validation Accuracy with Best Hyperparameters:", val_accuracy)
Training Accuracy with Best Hyperparameters: 0.8634868421052632
Validation Accuracy with Best Hyperparameters: 0.8486842105263158
In [140]:
# SVM
svc = SVC(probability=True)
parameter = {
    "gamma":[0.0001, 0.001, 0.01, 0.1],
    'C': [0.01, 0.05,0.5, 0.01, 1, 10, 15, 20]
}
grid_search = GridSearchCV(svc, parameter)
grid_search.fit(X_train, y_train)
Out[140]:
GridSearchCV(estimator=SVC(probability=True),
             param_grid={'C': [0.01, 0.05, 0.5, 0.01, 1, 10, 15, 20],
                         'gamma': [0.0001, 0.001, 0.01, 0.1]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(probability=True),
             param_grid={'C': [0.01, 0.05, 0.5, 0.01, 1, 10, 15, 20],
                         'gamma': [0.0001, 0.001, 0.01, 0.1]})
SVC(probability=True)
SVC(probability=True)
In [97]:
# best_parameter
grid_search.best_params_
Out[97]:
{'C': 10, 'gamma': 0.1}
In [98]:
grid_search.best_score_
Out[98]:
0.8618615363771847
In [99]:
svc = SVC(C=10, gamma = 0.01, probability=True)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(accuracy_score(y_train, svc.predict(X_train)))
svc_acc = accuracy_score(y_test, svc.predict(X_test))
print(accuracy_score(y_test, svc.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.8618421052631579
0.9013157894736842
[[88 10]
 [ 5 49]]
              precision    recall  f1-score   support

           0       0.95      0.90      0.92        98
           1       0.83      0.91      0.87        54

    accuracy                           0.90       152
   macro avg       0.89      0.90      0.89       152
weighted avg       0.91      0.90      0.90       152

In [100]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

svm_model = SVC(kernel='linear')

svm_model.fit(X_train, y_train)

y_train_pred = svm_model.predict(X_train)

y_val_pred = svm_model.predict(X_val)

train_accuracy = accuracy_score(y_train, y_train_pred)

val_accuracy = accuracy_score(y_val, y_val_pred)

print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)

train_confusion = confusion_matrix(y_train, y_train_pred)
val_confusion = confusion_matrix(y_val, y_val_pred)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.heatmap(train_confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Training)')

plt.subplot(1, 2, 2)
sns.heatmap(val_confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Validation)')
plt.show()
Training Accuracy: 0.837171052631579
Validation Accuracy: 0.8552631578947368
In [101]:
# Decision Tree
In [102]:
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
y_pred = DT.predict(X_test)
print(accuracy_score(y_train, DT.predict(X_train)))

print(accuracy_score(y_test, DT.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
1.0
0.7763157894736842
[[83 15]
 [19 35]]
              precision    recall  f1-score   support

           0       0.81      0.85      0.83        98
           1       0.70      0.65      0.67        54

    accuracy                           0.78       152
   macro avg       0.76      0.75      0.75       152
weighted avg       0.77      0.78      0.77       152

In [103]:
# hyperparameter tuning of dt
grid_param = {
    'criterion':['gini','entropy'],
    'max_depth' :  [3,5,7,10],
    'splitter' : ['best','radom'],
    'min_samples_leaf':[1,2,3,5,7], 
    'min_samples_split':[1,2,3,5,7], 
    'max_features':['auto','sqrt','log2']
}
grid_search_dt = GridSearchCV(DT, grid_param, cv=50, n_jobs=-1, verbose = 1)
grid_search_dt.fit(X_train, y_train)
Fitting 50 folds for each of 1200 candidates, totalling 60000 fits
Out[103]:
GridSearchCV(cv=50, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 5, 7, 10],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'min_samples_leaf': [1, 2, 3, 5, 7],
                         'min_samples_split': [1, 2, 3, 5, 7],
                         'splitter': ['best', 'radom']},
             verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=50, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 5, 7, 10],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'min_samples_leaf': [1, 2, 3, 5, 7],
                         'min_samples_split': [1, 2, 3, 5, 7],
                         'splitter': ['best', 'radom']},
             verbose=1)
DecisionTreeClassifier()
DecisionTreeClassifier()
In [104]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [5],
    'min_samples_split': [3],
    'min_samples_leaf': [7],
    'criterion': ['gini', 'entropy']  
}

decision_tree_model = DecisionTreeClassifier(random_state=42)

grid_search = GridSearchCV(decision_tree_model, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

best_model.fit(X_train, y_train)

train_accuracy = best_model.score(X_train, y_train)
val_accuracy = best_model.score(X_test, y_test)

print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)
Training Accuracy: 0.9210526315789473
Validation Accuracy: 0.7763157894736842
In [105]:
grid_search_dt.best_params_
Out[105]:
{'criterion': 'entropy',
 'max_depth': 7,
 'max_features': 'log2',
 'min_samples_leaf': 3,
 'min_samples_split': 3,
 'splitter': 'best'}
In [106]:
grid_search_dt.best_score_
Out[106]:
0.8821794871794871
In [107]:
DT = grid_search_dt.best_estimator_
y_pred = DT.predict(X_test)
print(accuracy_score(y_train, DT.predict(X_train)))
dt_acc = accuracy_score(y_test, DT.predict(X_test))
print(accuracy_score(y_test, DT.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.8799342105263158
0.8618421052631579
[[92  6]
 [15 39]]
              precision    recall  f1-score   support

           0       0.86      0.94      0.90        98
           1       0.87      0.72      0.79        54

    accuracy                           0.86       152
   macro avg       0.86      0.83      0.84       152
weighted avg       0.86      0.86      0.86       152

In [108]:
rand_clf = RandomForestClassifier(criterion = 'entropy', max_depth = 15, max_features = 0.75, min_samples_leaf = 2, min_samples_split = 3, n_estimators = 130)
rand_clf.fit(X_train, y_train)
Out[108]:
RandomForestClassifier(criterion='entropy', max_depth=15, max_features=0.75,
                       min_samples_leaf=2, min_samples_split=3,
                       n_estimators=130)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_depth=15, max_features=0.75,
                       min_samples_leaf=2, min_samples_split=3,
                       n_estimators=130)
In [109]:
y_pred = rand_clf.predict(X_test)
In [110]:
y_pred = rand_clf.predict(X_test)
print(accuracy_score(y_train, rand_clf.predict(X_train)))
rand_acc = accuracy_score(y_test, rand_clf.predict(X_test))
print(accuracy_score(y_test, rand_clf.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.9967105263157895
0.881578947368421
[[92  6]
 [12 42]]
              precision    recall  f1-score   support

           0       0.88      0.94      0.91        98
           1       0.88      0.78      0.82        54

    accuracy                           0.88       152
   macro avg       0.88      0.86      0.87       152
weighted avg       0.88      0.88      0.88       152

In [111]:
gbc = GradientBoostingClassifier()

parameters = {
    'loss': ['deviance', 'exponential'],
    'learning_rate': [0.001, 0.1, 1, 10],
    'n_estimators': [100, 150, 180, 200]
}

grid_search_gbc = GridSearchCV(gbc, parameters, cv = 10, n_jobs = -1, verbose = 1)
grid_search_gbc.fit(X_train, y_train)
Fitting 10 folds for each of 32 candidates, totalling 320 fits
Out[111]:
GridSearchCV(cv=10, estimator=GradientBoostingClassifier(), n_jobs=-1,
             param_grid={'learning_rate': [0.001, 0.1, 1, 10],
                         'loss': ['deviance', 'exponential'],
                         'n_estimators': [100, 150, 180, 200]},
             verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=GradientBoostingClassifier(), n_jobs=-1,
             param_grid={'learning_rate': [0.001, 0.1, 1, 10],
                         'loss': ['deviance', 'exponential'],
                         'n_estimators': [100, 150, 180, 200]},
             verbose=1)
GradientBoostingClassifier()
GradientBoostingClassifier()
In [112]:
grid_search_gbc.best_params_
Out[112]:
{'learning_rate': 0.1, 'loss': 'exponential', 'n_estimators': 100}
In [113]:
grid_search_gbc.best_score_
Out[113]:
0.8981147540983606
In [114]:
gbc = GradientBoostingClassifier(learning_rate = 0.1, loss = 'exponential', n_estimators = 150)
gbc.fit(X_train, y_train)
Out[114]:
GradientBoostingClassifier(loss='exponential', n_estimators=150)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(loss='exponential', n_estimators=150)
In [115]:
gbc = grid_search_gbc.best_estimator_
y_pred = gbc.predict(X_test)
print(accuracy_score(y_train, gbc.predict(X_train)))
gbc_acc = accuracy_score(y_test, gbc.predict(X_test))
print(accuracy_score(y_test, gbc.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.9802631578947368
0.8421052631578947
[[94  4]
 [20 34]]
              precision    recall  f1-score   support

           0       0.82      0.96      0.89        98
           1       0.89      0.63      0.74        54

    accuracy                           0.84       152
   macro avg       0.86      0.79      0.81       152
weighted avg       0.85      0.84      0.83       152

In [116]:
from xgboost import XGBClassifier 
xgb = XGBClassifier(objective = 'binary:logistic', learning_rate = 0.01, max_depth = 10, n_estimators = 180)

xgb.fit(X_train, y_train)
Out[116]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=10, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=180, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=10, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=180, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)
In [117]:
y_pred = xgb.predict(X_test)
print(accuracy_score(y_train, xgb.predict(X_train)))
xgb_acc = accuracy_score(y_test, xgb.predict(X_test))
print(accuracy_score(y_test, xgb.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.9819078947368421
0.8552631578947368
[[95  3]
 [19 35]]
              precision    recall  f1-score   support

           0       0.83      0.97      0.90        98
           1       0.92      0.65      0.76        54

    accuracy                           0.86       152
   macro avg       0.88      0.81      0.83       152
weighted avg       0.86      0.86      0.85       152

In [118]:
# Model Comparison
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'KNN', 'SVM', 'Decision Tree Classifier', 'Random Forest Classifier', 'Gradient Boosting Classifier', 'XgBoost'],
    'Score': [100*round(log_reg_acc,4), 100*round(knn_acc,4), 100*round(svc_acc,4), 100*round(dt_acc,4), 100*round(rand_acc,4), 
              100*round(gbc_acc,4), 100*round(xgb_acc,4)]
})
models.sort_values(by = 'Score', ascending = False)
Out[118]:
Model Score
2 SVM 90.13
0 Logistic Regression 89.47
1 KNN 88.16
4 Random Forest Classifier 88.16
3 Decision Tree Classifier 86.18
6 XgBoost 85.53
5 Gradient Boosting Classifier 84.21
In [119]:
import pickle
model = gbc_acc
pickle.dump(model, open("diabetes.pkl",'wb'))
In [120]:
from sklearn import metrics
plt.figure(figsize=(8,5))
models = [
{
    'label': 'LR',
    'model': log_reg,
},
{
    'label': 'DT',
    'model': DT,
},
{
    'label': 'SVM',
    'model': svc,
},
{
    'label': 'KNN',
    'model': knn,
},
{
    'label': 'XGBoost',
    'model': xgb,
},
{
    'label': 'RF',
    'model': rand_clf,
},
{
    'label': 'GBDT',
    'model': gbc,
}
]
for m in models:
    model = m['model'] 
    model.fit(X_train, y_train) 
    y_pred=model.predict(X_test) 
    fpr1, tpr1, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
    auc = metrics.roc_auc_score(y_test,model.predict(X_test))
    plt.plot(fpr1, tpr1, label='%s - ROC (area = %0.2f)' % (m['label'], auc))

plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity (False Positive Rate)', fontsize=12)
plt.ylabel('Sensitivity (True Positive Rate)', fontsize=12)
plt.title('ROC - Diabetes Prediction', fontsize=12)
plt.legend(loc="lower right", fontsize=12)
plt.savefig("roc_diabetes.jpeg", format='jpeg', dpi=400, bbox_inches='tight')
plt.show()
In [121]:
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
models = [
{
    'label': 'LR',
    'model': log_reg,
},
{
    'label': 'DT',
    'model': DT,
},
{
    'label': 'SVM',
    'model': svc,
},
{
    'label': 'KNN',
    'model': knn,
},
{
    'label': 'XGBoost',
    'model': xgb,
},
{
    'label': 'RF',
    'model': rand_clf,
},
{
    'label': 'GBDT',
    'model': gbc,
}
]
means_roc = []
means_accuracy = [100*round(log_reg_acc,4), 100*round(dt_acc,4), 100*round(svc_acc,4), 100*round(knn_acc,4), 100*round(xgb_acc,4), 
                  100*round(rand_acc,4), 100*round(gbc_acc,4)]

for m in models:
    model = m['model'] 
    model.fit(X_train, y_train) 
    y_pred=model.predict(X_test) 
    fpr1, tpr1, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
    auc = metrics.roc_auc_score(y_test,model.predict(X_test))
    auc = 100*round(auc,4)
    means_roc.append(auc)

print(means_accuracy)
print(means_roc)

# data to plot
n_groups = 7
means_accuracy = tuple(means_accuracy)
means_roc = tuple(means_roc)

# create plot
fig, ax = plt.subplots(figsize=(8,5))
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8

rects1 = plt.bar(index, means_accuracy, bar_width,
alpha=opacity,
color='mediumpurple',
label='Accuracy (%)')

rects2 = plt.bar(index + bar_width, means_roc, bar_width,
alpha=opacity,
color='rebeccapurple',
label='ROC (%)')

plt.xlim([-1, 8])
plt.ylim([60, 95])

plt.title('Performance Evaluation - Diabetes Prediction', fontsize=12)
plt.xticks(index, ('   LR', '   DT', '   SVM', '   KNN', 'XGBoost' , '   RF', '   GBDT'), rotation=40, ha='center', fontsize=12)
plt.legend(loc="upper right", fontsize=10)
plt.savefig("PE_diabetes.jpeg", format='jpeg', dpi=400, bbox_inches='tight')
plt.show()
[89.47, 86.18, 90.13, 88.16000000000001, 85.53, 88.16000000000001, 84.21]
[87.11, 78.03999999999999, 90.27, 89.44, 80.88, 83.47, 79.44]

Model deployment¶

In [146]:
import pickle
filename='diabetes.sav'
pickle.dump(model, open(filename,'wb'))

Conclusion¶

I have done the "Diabetes Prediction" task. Originally, the dataset had 769 records. Besides, creating 7 different models and tuning their parameters were very useful for evaluating their performance to decide which one is the most effective for predicting diabetes disease. As mentioned above, because of the unique characteristics of the medical industry, correctly predicting diseases such as diabetes is crucial so choosing which model provided the highest value of Recall should be on the top of priority, also in the feature engineering i create new columns "NewBMI_Obesity1','NewBMI_Obesity2','NewBMI_Obesity'NewBMI_Overweight','NewBMI_Underweight','NewInsulinScore_Normal','NewGlucose _Low','NewGlucose_Normal''NewGlucose_Overweight', 'NewGlucose_Secret'], enhance the predictive power of the dataset. Finally, as showed in the model score the Gradient Boosting Classifier provided the highest score (91.45%),

Future Possible Work¶

  • Betterment of results using different hyperparameters for tuning
  • Implementing more models to gain better results
  • Using the same method to predict response